    Mittetriviaalselt sarnaste dokumentide otsimine suurest dokumentide korpusest

    Käesoleva magistritöö eesmärgiks on uurida, kuidas leida mittetriviaalselt sarnaseid dokumente suurest dokumentide hulgast. Antud töös kirjeldatakse nii traditsioonilisi meetodeid dokumentide sarnasuse uurimiseks kui ka tutvustatakse uusi. Lisaks viiakse läbi eksperimendid, et uurida väljapakutud mõõtude käitumist andmetel. Traditsioonilised dokumentide sarnasusmeetodid mõõdavad sarnaste sõnade esinemist kahes dokumendis. Antud töös käsitleme, mis probleemid kaasnevad kui me kasutame dokumentide sarnasusmõõdu arvutamisel vaid viimastes leiduvaid sõnu, tutvustame olemasolevaid kui ka pakume välja uusi mõõte nende probleemide ületamiseks. Dokumendid on mittetriviaalselt sarnased, kui nad sisaldavad vähe ühiseid sõnu, kuid on kontekstuaalselt sarnased. Selleks, et tuvastada dokumentide konteksti pakume töös välja taustgraafi kontseptsiooni. Taustgraafi eesmärk on modelleerida sõnade ehk kontseptsioonidevahelist seost, andes rohkem kaalu nendele sõnadele, mis esinevad tihti koos. Saadud taustgraafi kasutame erinevate dokumentidevaheliste sarnasusmõõtude arvutamiseks. Käesolevas töös käsitletakse ka kasutaja käitumise ja sarnasusmõõtude vahelist seost. Töös antakse lühiülevaade järjestuste kaevandamise põhimõistetest ning kasutakse neid, et uurida, kuidas erinevad sarnasusmõõdud modelleerivad kasutaja käitumist. Töös viiakse läbi erinevaid eksperimente uudisportaali Postimees.ee andmetel. Taustgraafi uurimisel näeme, et loodud graaf kirjeldab kontekstisiseseid kontseptsioonide vahelisi seoseid väga hästi. Uurides sarnasusmõõte näeme, et üleüldisel uudiste soovitamisel töötab meie väljapakutud meetoditest paremini traditsiooniline meetod. Mõõdud, mis kasutavad taustgraafi informatsiooni, annavad paremaid tulemusi traditsioonilistest meetoditest, juhul kui me kasutame väheseid, kuid kvaliteetseid andmeid dokumendi kohta. Käesolev magistritöö pakub välja uue metoodi dokumentide sarnasuse leidmiseks ning näeme, et antud meetodid töötavad kindlatel juhtudel paremini kui varem kasutusel olnud mõõdud.This thesis introduces the methods which are used for measuring the similarity between documents. The document similarity measures are an important topic in information retrieval and in document classification systems. Finding similar documents from a document corpus is applicable in many different fields - web search engines, news aggregation services, advertising systems et cetera. An important aspect for a document similarity measure is, that the human opinion of the similarity should concur with the score of similarity. The problem of semantic similarity arises. The standard way to find similarity between documents is to compare the co-occurrence of words in them. Thus it is possible, that two documents which are contextually very similar, but to dot contain the same words, are marked dissimilar by the standard document similarity measures. The goal of the semantic similarity measures is to take into account the context of the documents and use this information for measuring the similarity. The goal of this thesis is to first give an overview of different methods which are used for standard and for semantic document similarity. The second goal is to experiment with the document similarity measures on a news portal dataset and to explore whether we can find some interesting properties of those measures. The motivation for the topic originates from an idea to create a new advertising network which is able to target advertisements better than the networks currently in the market. The goal was to analyse whether we could find a simple, intuitive, yet effective method for finding the non-trivial similarity between documents

    Word Associations as a Language Model for Generative and Creative Tasks

    In order to analyse natural language and gain a better understanding of documents, a common approach is to produce a language model which creates a structured representation of language which could then be used further for analysis or generation. This thesis will focus on a fairly simple language model which looks at word associations which appear together in the same sentence. We will revisit a classic idea of analysing word co-occurrences statistically and propose a simple parameter-free method for extracting common word associations, i.e. associations between words that are often used in the same context (e.g., Batman and Robin). Additionally we propose a method for extracting associations which are specific to a document or a set of documents. The idea behind the method is to take into account the common word associations and highlight such word associations which co-occur in the document unexpectedly often. We will empirically show that these models can be used in practice at least for three tasks: generation of creative combinations of related words, document summarization, and creating poetry. First the common word association language model is used for solving tests of creativity -- the Remote Associates test. Then observations of the properties of the model are used further to generate creative combinations of words -- sets of words which are mutually not related, but do share a common related concept. Document summarization is a task where a system has to produce a short summary of the text with a limited number of words. In this thesis, we will propose a method which will utilise the document-specific associations and basic graph algorithms to produce summaries which give competitive performance on various languages. Also, the document-specific associations are used in order to produce poetry which is related to a certain document or a set of documents. The idea is to use documents as inspiration for generating poems which could potentially be used as commentary to news stories. Empirical results indicate that both, the common and the document-specific associations, can be used effectively for different applications. This provides us with a simple language model which could be used for different languages.Kielimalleja käytetään usein luonnollisten kielten ja dokumenttien ymmärtämiseen. Kielimalli on kielen rakenteellinen esitysmuoto, jota voidaan käyttää kielen analyysiin tai sen tuottamiseen. Tässä työssä esitetään yksinkertainen kielimalli, joka perustuu assosiaatioihin sanojen välillä, jotka esiintyvät samassa lausessa. Ensin tutustumme klassiseen menetelmään analysoida sanojen yhteisesiintymiä tilastollisesti, jonka perusteella esittelemme parametri-vapaan menetelmän tuottaa yleisiä sana-assosiaatioita. Nämä sana-assosiaatiot ovat yhteyksiä sellaisten sanojen välillä, jotka esiintyvät samoissa asiayhteyksissä, kuten esimerkiksi Batman ja Robin. Lisäksi esittelemme menetelmän, joka tuottaa näitä assosiaatioita tietylle dokumentille tai joukolle dokumentteja. Menetelmä perustuu niiden sana-assosiaatioiden huomioimiseen, jotka ovat lähde-dokumenteissa erityisen yleisiä. Näytämme empiirisesti, että kielimallejamme voidaan käyttää ainakin kolmeen tarkoitukseen: luovien sanayhdistelmien tuottamiseen, dokumenttien referointiin ja runojen tuottamiseen. Ratkomme ensin yleisiin sana-assosiaatioihin perustuvalla mallillamme luovuutta testaavia Remote Associates -kokeita. Sen jälkeen tuotamme mallista tehtyjen havaintojen perusteella luovia sanayhdistelmiä. Nämä yhdistelmät sisältävät sanoja, jotka eivät välttämättä ole keskenään toisiinsa liittyviä, mutta ne jakavat joitakin yhdistäviä käsitteitä. Dokumentin referointi viittaa tehtävään, jossa pitää tuottaa rajoitetun pituinen lyhennelmä pidemmästä dokumentista. Esitämme menetelmän joka tuottaa eri kielillä tasoltaan kilpailukykyisiä referaatteja, käyttäen dokumenttikohtaisia sana-assosiaatioita sekä yksinkertaisia graafi-algoritmeja. Assosiaatioiden avulla voidaan tuottaa myös dokementtikohtaisia runoja. Dokumenttien inspiroimia runoja voitaisiin käyttää esimerkiksi uutisartikkeleiden kommentointiin. Tuloksemme niin yleisiin kuin dokumenttikohtaisiin assosiaatioihin perustuvista malleista osoittavat, että näitä malleja voidaan käyttää tehokkaasti eri käyttötarkoituksiin. Tuloksena on yksinkertainen kielimalli, jota voidaan käyttää useiden eri kielten kanssa

    Corpus-Based Generation of Content and Form in Poetry

    We employ a corpus-based approach to generate content and form in poetry. The main idea is to use two different corpora, on one hand, to provide semantic content for new poems, and on the other hand, to generate a specific grammatical and poetic structure. The approach uses text mining methods, morphological analysis, and morphological synthesis to produce poetry in Finnish. We present some promising results obtained via the combination of these methods and preliminary evaluation results of poetry generated by the system.Peer reviewe

    Lexical Creativity from Word Associations

    A fluent ability to associate tasks, concepts, ideas, knowledge and experiences in a relevant way is often considered an important factor of creativity, especially in problem solving. We are interested in providing computational support for discovering such creative associations. In this paper we design minimally supervised methods that can perform well in the remote associates test (RAT), a well-known psychometric measure of creativity. We show that with a large corpus of text and some relatively simple principles, this can be achieved. We then develop methods for a more general word association model that could be used in lexical creativity support systems, and which also could be a small step towards lexical creativity in computers.Peer reviewe

    Arts, News, Poetry — The Art of Framing

    Software Newsroom – an approach to automation of news search and editing

    We have developed tools and applied methods for automated identification of potential news from textual data for an automated news search system called Software Newsroom. The purpose of the tools is to analyze data collected from the internet and to identify information that has a high probability of containing new information. The identified information is summarized in order to help understanding the semantic contents of the data, and to assist the news editing process. It has been demonstrated that words with a certain set of syntactic and semantic properties are effective when building topic models for English. We demonstrate that words with the same properties in Finnish are useful as well. Extracting such words requires knowledge about the special characteristics of the Finnish language, which are taken into account in our analysis. Two different methodological approaches have been applied for the news search. One of the methods is based on topic analysis and it applies Multinomial Principal Component Analysis (MPCA) for topic model creation and data profiling. The second method is based on word association analysis and applies the log-likelihood ratio (LLR). For the topic mining, we have created English and Finnish language corpora from Wikipedia and Finnish corpora from several Finnish news archives and we have used bag-of-words presentations of these corpora as training data for the topic model. We have performed topic analysis experiments with both the training data itself and with arbitrary text parsed from internet sources. The results suggest that the effectiveness of news search strongly depends on the quality of the training data and its linguistic analysis. In the association analysis, we use a combined methodology for detecting novel word associations in the text. For detecting novel associations we use the background corpus from which we extract common word associations. In parallel, we collect the statistics of word co-occurrences from the documents of interest and search for associations with larger likelihood in these documents than in the background. We have demonstrated the applicability of these methods for Software Newsroom. The results indicate that the background-foreground model has significant potential in news search. The experiments also indicate great promise in employing background-foreground word associations for other applications. A combined application of the two methods is planned as well as the application of the methods on social media using a pre-translator of social media language.Peer reviewe

    The D2-D6 System and a Fibered AdS Geometry

    The system of D2 branes localized on or near D6 branes is considered. The world-volume theory on the D2 branes is investigated, using its conjectured relation to the near-horizon geometry. The results are in agreement with known facts and expectations for the corresponding field theory and a rich phase structure is obtained as a function of the energy scale and the number of branes. In particular, for an intermediate range of the number of D6 branes, the IR geometry is that of an AdS_4 space fibered over a compact space. This D2-D6 system is compared to other systems, related to it by compactification and duality and it is shown that the qualitative differences have compatible explanations in the geometric and field-theoretic descriptions. Another system -- that of NS5 branes located at D6 branes -- is also briefly studied, leading to a similar phase structure.Comment: 35 pages (Latex) and 2 figures (encapsulated postscript). Ver2: added discussion of the relation to the system without D6 branes (in the introduction and in figure 1); added description of the geometrical realization of the R symmetries (in section 3.1

    Conceptual Representations for Computational Concept Creation

    Computational creativity seeks to understand computational mechanisms that can be characterized as creative. The creation of new concepts is a central challenge for any creative system. In this article, we outline different approaches to computational concept creation and then review conceptual representations relevant to concept creation, and therefore to computational creativity. The conceptual representations are organized in accordance with two important perspectives on the distinctions between them. One distinction is between symbolic, spatial and connectionist representations. The other is between descriptive and procedural representations. Additionally, conceptual representations used in particular creative domains, such as language, music, image and emotion, are reviewed separately. For every representation reviewed, we cover the inference it affords, the computational means of building it, and its application in concept creation.Peer reviewe